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Abstract 

We accelerate the computation of spherical harmonic transforms, using what is known as the butterfly scheme. This 
provides a convenient alternative to the approach taken in the second paper from this series on "Fast algorithms for 
spherical harmonic expansions." The requisite precomputations become manageable when organized as a "depth-first 
traversal" of the program's control-flow graph, rather than as the perhaps more natural "breadth-first traversal" that 
processes one-by-one each level of the multilevel procedure. We illustrate the results via several numerical examples. 
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1. Introduction 

The butterfly algorithm, introduced in d and is a procedure for rapidly applying certain matrices to ar- 
bitrary vectors. (Section[3]below provides a brief introduction to the butterfly.) The present paper uses the butterfly 
method in order to accelerate spherical harmonic transforms. The butterfly procedure does not require the use of 
extended-precision arithmetic in order to attain accuracy very close to the machine precision, not even in its precom- 
putations — unlike the alternative approach taken in the predecessor liTsIl of the present paper 

Unlike some previous works on the butterfly, the present article does not use on-the-fly evaluation of individual 
entries of the matrices whose applications to vectors are being accelerated. Instead, we require only efficient evaluation 
of full columns of the matrices, in order to make the precomputations affordable. Furthermore, efficient evaluation of 
full columns enables the acceleration of the application to vectors of both the matrices and their transposes. On-the-fly 
evaluation of columns of the matrices associated with spherical harmonic transforms is available via the three-term 
recurrence relations satisfied by associated Legendre functions (see, for example, Section|5]below). 

The precomputations for the butterfly become affordable when organized as a "depth-first traversal" of the pro- 
gram's control-flow graph, rather than as the perhaps more natural "breadth-first traversal" that processes one-by-one 
each level of the multilevel butterfly procedure (see Section|4]below). 

The present article is supposed to complement [11] and |15], combining ideas from both. Although the present 
paper is self-contained in principle, we strongly encourage the reader to begin with ifllll and ifTsll . The original is ifioll . 
Major recent developments are in [4] and [17]. The introduction in ilSll summarizes most prior work on computing 
fast spherical harmonic transforms; a new application appears in [12']. These articles and their references highlight 
the computational use of spherical harmonic transforms in meteorology and quantum chemistry. The structure of the 
remainder of the present article is as follows: Section|2]reviews elementary facts about spherical harmonic transforms. 
Section[3]describes basic tools from previous works. Section |4]organizes the preprocessing for the butterfly to make 
memory requirements affordable. Section |5] outlines the application of the butterfly scheme to the computation of 
spherical harmonic transforms. Section |6] describes the results of several numerical tests. Section |7] draws some 
conclusions. 

Throughout, we abbreviate "interpolative decomposition" to "ID" (see Subsection l3.1l for a description of the ID). 
The butterfly procedures formulated in ifloll . | IJ |, and the present paper all use the ID for efficiency. 
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2. An overview of spherical harmonic transforms 



The spherical harmonic expansion of a bandhmited function / on the surface of the sphere has the form 

21-1 k 



m H""(cos(0)) e""^ (1) 

k=0 m=-k 



where (6, (p) are the standard spherical coordinates on the two-dimensional surface of the unit sphere in K^, G (0, tt) 
and ip e (0, Ijt), and Pj, is the normalized associated Legendre function of degree k and order \m\ (see, for example, 
Subsection l3.3l for the definition of normalized associated Legendre functions). Please note that the superscript m in 
yS™ denotes an index, rather than a power "Normalized" refers to the fact that the normalized associated Legendre 
functions of a fixed order are orthonormal on (-1,1) with respect to the standard inner product. Obviously, the 
expansion ([T]l contains 41' terms. The complexity of the function / determines I. 

In many areas of scientific computing, particularly those using spectral methods for the numerical solution of 
partial diff'erential equations, we need to evaluate the coefficients in an expansion of the form ([TJ for a function / 
given by a table of its values at a collection of appropriately chosen nodes on the two-dimensional surface of the unit 
sphere. Conversely, given the coefficients yS™ in ([TJ, we often need to evaluate / at a collection of points on the surface 
of the sphere. The former is known as the forward spherical harmonic transform, and the latter is known as the inverse 
spherical harmonic transform. A standard discretization of the surface of the sphere is the "tensor product," consisting 
of all pairs of the form (Ok, (fj), with cos(6'()), cos(0i), . . . , 005(621-2), cos(02/-i) being the Gauss-Legendre quadrature 
nodes of degree 21, that is, 

- 1 < cos(6lo) < cos(0i) < . . . < COS(02/-2) < cos(6'2/-i) < 1 (2) 

and 

]^,(cos(0*)) = (3) 
for k -Q, 1, . . . , 2/ - 2, 2/ - 1, and with (fo, ipi, . . . , (/?4/_3, (P41-2 being equispaced on the interval (0, In), that is. 



2;r(. 



for j = 0, 1, . . . , 4/-3, 4Z-2. This leads immediately to numerical schemes for both the forward and inverse spherical 
harmonic transforms whose costs are proportional to P . 

Indeed, given a function / defined on the two-dimensional surface of the unit sphere by ([T]i, we can rewrite ([TJ in 
the form 

21-1 21-1 

m^)^ e""^Yj/5l'P^"\cos{e)). (5) 

m=-2l+\ k=\m\ 

For a fixed value of 9, each of the sums over A; in ((5j contains no more than 21 terms, and there are 4Z - 1 such sums 
(one for each value of m); since the inverse spherical harmonic transform involves 21 values 6q,9\, . . ., 621-2, S21-1, the 
cost of evaluating all sums over A: in ((5j is proportional to P. Once all sums over k have been evaluated, each sum over 
m may be evaluated for a cost proportional to I (since each of them contains 41-1 terms), and there are (20(4/ - 1) 
such sums to be evaluated (one for each pair (6k, ipj)), leading to costs proportional to P for the evaluation of all sums 
over m in ([5j. The cost of the evaluation of the whole inverse spherical harmonic transform (in the form ([5j) is the sum 
of the costs for the sums over k and the sums over m, and is also proportional to a virtually identical calculation 
shows that the cost of evaluating of the forward spherical harmonic transform is also proportional to P. 

A trivial modification of the scheme described in the preceding paragraph uses the fast Fourier transform (FFT) 
to evaluate the sums over m in ([5j, approximately halving the operation count of the entire procedure. Several other 
careful considerations (see, for example, iQl and ifisll ) are able to reduce the costs by 50% or so, but there is no simple 
trick for reducing the costs of the whole spherical harmonic transform (either forward or inverse) below P. The 
present paper presents faster (albeit more complicated) algorithms for both forward and inverse spherical harmonic 
transforms. Specifically, the present article provides a fast algorithm for evaluating a sum over k in ((5jat6l = 0o, 6u 
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■ ■ ■ , 021-2, 02i-\, given the coefficients jS|^|,j8™|^j, . . ■,^'21-2^^21-1^ for ^ fixed m. Moreover, the present paper provides a 
fast algorithm for the inverse procedure of determining the coefficients y6j"j|, • • • , /?'2;_2' ^2/-i from the values of a 

sum over k m ^ ?A 9 - 9q, 9i, . . . , 921-2, 02i-i - FFTs or fast discrete sine and cosine transforms can be used to handle 
the sums over m in (|5]i efficiently. See iflill for a detailed summary and novel application of the overall method. The 
present article modifies portions of the method of II12II and OlSll . focusing exclusively on the modifications. 



3. Preliminaries 

In this section, we summarize certain facts from mathematical and numerical analysis, used in Sections |4] and |5] 
Subsection l3.1l describes interpolative decompositions (IDs). Subsection l3.2l outlines the butterfly algorithm. Subsec- 
tion |33] summarizes basic properties of normalized associated Legendre functions. 

3.1. Interpolative decompositions 

In this subsection, we define interpolative decompositions (IDs) and summarize their properties. 
The following lemma states that, for any m x n matrix A of rank k, there exist an m x k matrix A**' whose columns 
constitute a subset of the columns of A, and a kxn matrix A, such that 

1 . some subset of the columns of A makes up the kx k identity matrix, 

2. A is not too large, and 

3. ^|„xA- ' ^kxn — ^mxn- 

Moreover, the lemma provides an approximation 

^mx/l ' ^kxn ~ ^mxn (6) 

when the exact rank of A is greater than k, yet the (^+l)st greatest singular value of A is still small. The lemma 
is a reformulation of Theorem 3.2 in |9] and Theorem 3 in |5]; its proof is based on techniques described in |0], 
lUt], and lfl6ll . We will refer to the approximation in (|6]l of A as an interpolative decomposition (ID). We call A the 
"interpolation matrix" of the ID. 

Lemma 3.1. Suppose that m and n are positive integers, and A is a real m X n matrix. 

Then, for any positive integer k with k < m and k < n, there exist a real kxn matrix A, and a real m X k matrix 
A*^*^ whose columns constitute a subset of the columns of A, such that 

1. some subset of the columns of A makes up the kxk identity matrix, 

2. no entry of A has an absolute value greater than 1, 

3. the spectral norm ( that is, the fi-operator norm) of A satisfies ||Ai:x«||2 ^ ^k{n — k) + \, 

4. the least (that is, the kth greatest) singular value of A is at least 1, 
^- ^^mxk ■ ^kxn - Kixn when k - m or k - n, and 

6. when k < m and k < n, the spectral norm ( that is, the fi-operator norm) o/A[*^^ ■ Akxn — ^mxn satisfies 

■ Aix„ - A,„x„||2 ^ V^(n -k) + l o-k+u (7) 

where (Tk+i is the (k+l)st greatest singular value of A. 

Properties 1, 2, 3, and 4 in Lemma lTTI ensure that the ID A**^^ ■ A of A is a numerically stable representation. Also, 
property 3 follows directly from properties 1 and 2, and property 4 follows directly from property 1 . 

Remark 3.2. Existing algorithms for the computation of the matrices A'*^ and A in Lemma [TTI are computationally 
expensive. We use instead the algorithm of fs] and ^ to produce matrices A**^* and A which satisfy slightly weaker 
conditions than those in Lemma ITTI We compute A**^* and A such that 

1 . some subset of the columns of A makes up the kxk identity matrix, 

2. no entry of A has an absolute value greater than 2, 
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3. the spectral norm (that is, the /^-operator norm) of A satisfies Aj^xn < ^/4k{n -k) + l, 

4. the least (that is, the A;th greatest) singular value of A is at least 1, 
5- '^Txk ■ '^kxn = A,„x„ when ^ = m or ^ = «, and 

6. when k < m and k < n, the spectral norm (that is, the Z^-operator norm) of AJ^^^ ■ A^xn - A,„x„ satisfies 

I l^mxk ■ ^kxn - A„,x„ 1 12 < V4fc(n -k)+l CTk^ 1 , (8) 

where (Tk+i is the (A:+l)st greatest singular value of A. 

For any positive real number s, the algorithm can identify the least k such that \\A^''^ ■ A - A^^ k s. Furthermore, the 
algorithm computes both A**' and A using at most 

CiD = 0{kmn log(«)) (9) 

floating-point operations, typically requiring only 

Ci'o = Oikmn). (10) 

3.2. The butterfly algorithm 

In this subsection, we outline a simple case of the butterfly algorithm from ifioll and see for a detailed 
description. 

Suppose that n is a positive integer, and A is an « x « matrix. Suppose further that e and C are positive real 
numbers, and ^ is a positive integer, such that any contiguous rectangular subblock of A containing at most Cn entries 
can be approximated to precision e by a matrix whose rank is k (using the Frobenius/Hilbert-Schmidt norm to measure 
the accuracy of the approximation); we will refer to this hypothesis as "f/ie rank property^ The running-time of the 
algorithm will be proportional to k^ IC\ taking C to be roughly proportional to k suffices for many matrices of interest 
(including nonequispaced and discrete Fourier transforms), so ideally k should be small. We will say that two matrices 
G and H are equal to precision e, denoted G ~ //, to mean that the spectral norm (that is, the /^-operator norm) of 
G-H is 0(£). 

We now explicitly use the rank property for subblocks of multiple heights, to illustrate the basic structure of the 
butterfly scheme. 

Consider any two adjacent contiguous rectangular subblocks L and R of A, each containing at most Cn entries and 
having the same numbers of rows, with L on the left and R on the right. Due to the rank property, there exist IDs 

L^L^'^'^-L (11) 

and 

R ^ 7?**' ■ R, (12) 

where L^** is a matrix having k columns, which constitute a subset of the columns of L, 7?**' is a matrix having k 
columns, which constitute a subset of the columns of R, L and R are matrices each having k rows, and all entries of L 
and R have absolute values of at most 2. 

To set notation, we concatenate the matrices L and R, and split the columns of the result in half (or approximately 
in half), obtaining T on top and B on the bottom: 



(.3, 



Observe that the matrices T and B each have at most Cn entries (since L and R each have at most Cn entries). 
Similarly, we concatenate the matrices L**^' and and spUt the columns of the result in half (or approximately in 
half), obtaining r*^*) and 

= (14) 
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Observe that the 2k columns of T*^*^ are also columns of T, and that the 2k columns of B*^*^ are also columns of B. 
Due to the rank property, there exist IDs 

and 

where r'*' is a matrix having k columns, which constitute a subset of the columns of r'^*', B^''^ is a matrix having k 
columns, which constitute a subset of the columns of B^^''\ T^*^ and B^^*^* are matrices each having k rows, and all 
entries of T^^*^' and B^^*^* have absolute values of at most 2. 
Combining (fTTTi-(fT6]l yields that 

„, ~ / 7 n \ 

(17) 



and 

B Si B<*^' ■ B(2*) ■ 



(i 







R 







^ 


7? 



(18) 



If we use m to denote the number of rows in L (which is the same as the number of rows in R), then the number 
of columns in L (or R) is at most Cn/m, and so the total number of entries in the matrices in the right-hand sides 
of (fTTl i and (fT2l i can be as large as 2mA: + 2k{Cn/m), whereas the total number of nonzero entries in the matrices in the 
right-hand sides of ( [TtI i and ( fTSl l is at most mk + Ak^ + 2k(Cn/m). If m is nearly as large as possible — nearly n — and 
k and C are much smaller than n, then mk + Ak? + 2k{Cn/m) is about half 2mk + 2k(Cn/m). Thus, the representation 
provided in ( fTTT i and ( fTSl ) of the merged matrix from ( fTlT l is more efficient than that provided in ( fTTT i and ( fT2] ). both 
in terms of the memory required for storage, and in terms of the number of operations required for applications to 
vectors. Notice the advantage of using the rank property for blocks of multiple heights. 

Naturally, we may repeat this process of merging adjacent blocks and splitting in half the columns of the result, 
updating the compressed representations after every split. We start by partitioning A into blocks each dimensioned 
n X [CJ (except possibly for the rightmost block, which may have fewer than [CJ columns), and then repeatedly group 
unprocessed blocks (of whatever dimensions) into disjoint pairs, processing these pairs by merging and splitting them 
into new, unprocessed blocks having fewer rows. The resulting multilevel representation of A allows us to apply A 
with precision s from the left to any column vector, or from the right to any row vector, using just 0((k^/C) n log(n)) 
floating-point operations (there will be (9(log(n)) levels in the scheme, and each level except for the last will only 
involve 0{n / C) interpolation matrices of dimensions k x {2k), such as T^*' and B^*^^). Figure[T]illustrates the resulting 
partitionings of A into blocks of various dimensions (but with every block having the same number of entries), when 
and C = 1 . For further details, see 



Remark 3.3. Needless to say, the same multilevel representation of A permits the rapid application of A both from the 
left to column vectors and from the right to row vectors. There is no need for constructing multilevel representations 
of both A and the transpose of A. 

Remark 3.4. In practice, the IDs used for accurately approximating subblocks of A do not all have the same fixed 
rank k. Instead, for each subblock, we determine the minimal possible rank such that the associated ID still approxi- 
mates the subblock to precision e, and we use this ID in place of one whose rank is k. Determining ranks adaptively 



in this manner accelerates the algorithm substantially. For further details, see 111 111 . All our implementations use this 
adaptation. 



3.3. Normalized associated Legendre functions 

In this subsection, we discuss several classical facts concerningnormalized associated Legendre functions. All of 
these facts follow trivially from results contained, for example, in llj] or il4ll . 
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Figure 1 : The partitionings in the multilevel decomposition for an 8 X 8 matrix with C = 1 
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Level 4 



L indicates the left member of a pair; R indicates the right member. 
T indicates the top member of a pair; B indicates the bottom member. 



For any nonnegative integers / and m such that I > m, we use to denote the normahzed associated Legendre 
function of degree / and order m, defined on (-1, 1) via the formula 



|2/+ 1 (I -my. 2\"l^ d' 

where P/ is the Legendre polynomial of degree /, 



— m Zl+l(l-m)\ i 2\ml d'" 



(see, for example, Chapter 8 of [17]). "Normalized" refers to the fact that the normalized associated Legendre functions 
of a fixed order m are orthonormal on (-1,1) with respect to the standard inner product. If / - m is even, then 
Pi (-X) - Pi (x) for any x e (-1, 1). If / - m is odd, then (-x) = -Pi (x) for any x e (-1, 1). 

The following lemma states that the normalized associated Legendre functions satisfy a certain self-adjoint second- 
order linear (Sturm-Liouville) differential equation. 

Lemma 3.5. Suppose that m is a nonnegative integer. 
Then, 

for any x & (-1,1), and I — m, m + \, m + 2, 

The following lemma states that the normalized associated Legendre function of order m and degree m + 2n has 
exactly n zeros inside (0, 1), and, moreover, that the normalized associated Legendre function of order m and degree 
m + 2n + \ also has exactly n zeros inside (0, 1). 

Lemma 3.6. Suppose that m and n are nonnegative integers with n > 0. 
Then, there exist precisely n real numbers jcq, x\, . .. , x„_2, such that 

< Xo < JCl < . . . < Xn-2 < x„-i < 1 (22) 

and 

K.2n(xj) = (23) 

for j — 0, \, . . . , n — 2, n — \. 

Moreover, there exist precisely n real numbers ya, y\, . .. , y„-2, Jn-i such that 

Kyo <yi < ... < y„^2 < y„-i < 1 (24) 

and 

K.2n.l(yj)-0 (25) 

for j — 0, \, . . . , n - 2, n - \. 
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Suppose that m and n are nonnegative integers with « > 0. Then, we define real numbers po, pi, . . . , Pn-2, Pn-i, 
ctq, (Ti, . . . , o"„_2, cr«-i, and cr„ via the formulae 

2(2m + 4« + l) 

P; = T (26) 

(l-(-;)^)(£^l2„M 

for j = 0, 1, . . . , n - 2, « - 1, where xq, xi, . . . , x„_2, are from (l23T l. 

2(2m + 4« + 3) 

O"/ = — r (27) 

(l-(3'.)')(z^^l2„.i(3',)) 

for j -0, 1, . . . , n - 2, « - 1, where yo,yi, ■ ■■ , yn-2, y«-i are from (l25T l, and 



2m + 4n + 3 

o-„ = T. (28) 

The following lemma describes what are known as Gauss-Jacobi quadrature formulae corresponding to associated 
Legendre functions. 



X 



Lemma 3.7. Suppose that m and n are nonnegative integers with n > 0. 
Then, 

dx (l - x^)™ p(x) ^Y,Pj{^' ^^J^'T P^^j^ ^29) 

for any even polynomial p of degree at most An — 2, where xq, X\, x„_2, are from f l23l ), ant/ po, pi, 
Pn-2, Pn-i defined in \ 
Furthermore, 

rfx (l - x^)'" /7(x) = fr„ ;7(0) + ^ (Ty (l - iyjf)" p(y j) (30) 
' >=o 
for any even polynomial p of degree at most 4n, where yo, yi, . . ., yn~2, Jn-i cire from ( 1251 ). and (Tq, (Ti, . . . , (Tn-i, cr„ 
are defined in l[27}l and ( 1281 ). 



Remark 3.8. Formulae (35) and (36) of OlSll incorrectly omitted the factors (1 - (xy)^)™ and (1 - {yj)^)'" appearing in 
the analogous ( |29] | and ( l30l l above. 

Suppose that m is a nonnegative integer. Then, we define real numbers c„,, c,„+i, c,„+2, ■ ■ ■ and dm, d,„+\, d,„+2, . . . 
via the formulae 



ci = 

for l-m, m+\,m + 2, and 



(l-m + l)(l-m + 2)(l + m+ l)(l + m + 2) 

1 J 



'\ (2Z+ l)(2/ + 3)2(2Z + 5) 



. -2-^-1 (32) 

(2/ - 1)(2/ + 3) 

for I — m, m + I, m + 2, . . . . 

The following lemma states that the normalized associated Legendre functions of a fixed order m satisfy a certain 
three-term recurrence relation. 

Lemma 3.9. Suppose that m is a nonnegative integer 
Then, 

pI'(x) = d, P"\x) + c, Pl2(x) (33) 
for any x e (-1, 1), and I — m or I — m + I, and 

^ pI'(x) = c,_2 p7-2(x) + di P'"(x) + c, Pl2(x) (34) 

for any x e (-1, 1), and I - m + 2, m + 3, m + 4, . . . , where Cm, Cm+i, Cm+2, ■ ■ ■ ore defined in diTT ). and dm, dm+i, dm+2, 
. . . are defined in ( 132 1 ). 



4. Precomputations for the butterfly scheme 

In this section, we discuss the preprocessing required for the butterfly algorithm summarized in Subsection 13.21 
We will be using the notation detailed in Subsection l3.2l 

Perhaps the most natural organization of the computations required to construct the multilevel representation of 
an n X n matrix A is first to process all blocks having n rows (Level 1 in Figure [T] above), then to process all blocks 
having about n/2 rows (Level 2 in Figure [U, then to process all blocks having about n/4 rows (Level 3 in Figure[T]l, 
and so on. Indeed, [HT] uses this organization, which amounts to a "breadth-first traversal" of the control-flow graph 
for the program applying A to a vector (see, for example, |3] for an introduction to "breadth-first" and "depth-first" 
orderings). This scheme for preprocessing is efficient when the entries of A can be efficiently computed on-the-fly, 
individually. (Of course, we are assuming that A has a suitable rank property, that is, that there are positive real 
numbers e and C, and a positive integer k, such that any contiguous rectangular subblock of A containing at most Cn 
entries can be approximated to precision e by a matrix whose rank is k, using the Frobenius/Hilbert-Schmidt norm to 
measure the accuracy of the approximation. Often, taking C to be roughly proportional to k suffices, and ideally k and 
s are small.) If the entries of A cannot be efficiently computed individually, however, then the "breadth-first traversal" 
may need to store 0(rp-) entries at some point during the precomputations, in order to avoid recomputing entries of 
the matrix. 

If individual columns of A (but not necessarily arbitrary individual entries) can be computed efficiently, then 
"depth-first traversal" of the control-flow graph requires only 0{{k^/C)nlog{n)) floating-point words of memory at 
any point during the precomputations, for the following reason. We will say that we "process" a block of A to mean 
that we merge it with another, and split and recompress the result, producing a pair of new, unprocessed blocks. 
Rather than starting the preprocessing by constructing all blocks having n rows, we construct each such block only 
after processing as many blocks as possible which previous processing creates, but which have not yet been processed. 
Furthermore, we construct each block having n rows only after having akeady constructed (and possibly processed) 
all blocks to its left. To reiterate, we construct a block having n rows only after having exhausted all possibilities for 
both creating and processing blocks to its left. 

For each processed block B, we need only store the interpolation matrix B^^*) and the indices of the columns 
chosen for the ID; we need not store the k columns of B*** selected for the ID, since the algorithm for applying A (or 
its transpose) to a vector never explicitly uses any columns of a block that has been merged with another and split, but 
instead interpolates from (or anterpolates to) the shorter blocks arising from the processing. Conveniently, the matrix 
that we must store is small - no larger than k x i2k). For each unprocessed block B, we do need to store the 
k columns in Z?**' selected for the ID, in addition to storing Z?*^'^* and the indices of the columns chosen for the ID, 
facilitating any subsequent processing. Although Z?**' may have many rows, it has only as many rows as B and hence 
is smaller when B has fewer rows. Thus, every time we process a pair of tall blocks, producing a new pair of blocks 
having half as many rows, the storage requirements for all these blocks together nearly halve. By always processing 
as many already constructed blocks as possible, we minimize the amount of memory required. 

5. Spherical harmonic transforms via the butterfly scheme 

In this section, we describe how to use the butterfly algorithm to compute fast spherical harmonic transforms, via 
appropriate modifications of the algorithm of ifTsIl . 

We substitute the butterfly algorithm for the divide-and-conquer algorithm of ff] used in Section 3.1 of fl^, 
otherwise leaving the approach of 1 15] unchanged. Specifically, given numbers /3o, j3i, . . . , /3n-2, Pn-\, we use the 
butterfly scheme to compute the numbers q-q, a\, . . . , a„-2, ctn-x defined via the formula 

«-i 

-Yul^j^i K^iMd, (35) 

— m — 111 — 111 — 111 

for i = 0, 1, . . . , n - 2, n - 1, where m is a nonnegative mteger, P,„^2^ . . . , P,„+iii-2^ Pm+in '^he normalized asso- 
ciated Legendre functions of order m defined in (fT9l l. xq, x\, . . . , Xn^i, -^n-i are the positive zeros of P,„+2n from ( l23l l. 
and po, Pi, ■ ■ ■ , Pn-2, Pn-i are the corresponding quadrature weights from (|29] |. Similarly, given numbers ao, ai, 
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. . . , a„-2, o:„-i, we use the butterfly scheme to compute the numbers /So, Pi, . . . , [i„-2, Pn-\ satisfying ( |35] |. The fac- 
tors -y^, yjpi, ^Jp„-2, -\/p„-i ensure that the hnear transformation mapping Po, Pi, ■■ ■, Pn-2, Pn-\ to ao, ai, 
a„-2, <Xn-\ via (l35T l is unitary (due to ( fT9l l, (l29T l. and the orthonormaHty of the normaHzed associated Legendre 
functions on (-1, 1)), so that the inverse of the linear transformation is its transpose. 

Moreover, given numbers vq, vi, . . . , v„_2, v„_i, we use the butterfly scheme to compute the numbers yuo, 
fJ^n-i, l^n-\ defined via the formula 

n-l 

= ^ V; Vo7 K+ij+iiVi), (36) 

— m — w! — ifi — m 

for / = 0, 1, . . . , n - 2, n - 1, where m is a nonnegative integer, P,„+i, P„i+3, Pm+in-i^ Pm+in+i ^re the nor- 
malized associated Legendre functions of order m defined in ( fT9] l, yo, y\, ■■■ , yn-2, Jn-i are the positive zeros of 
Pm+2n+i from dZSl l. and (Tq, cr\, . . . , cr„_2, cr„_i are the corresponding quadrature weights from dSOl l. Similarly, given 
numbers /io, /^i, ■ ■ ■ , ^in-2, Mn-i, we use the butterfly scheme to compute the numbers vq, v\, . . ., v„_2, v„_i satisfy- 
ing ( |36] |. As above, the factors -y^, -\/ct7, . . . , ycr„_2, VovT ensure that the linear transformation mapping vq, vi, 
. . . , y„_2, to jiQ, Hi, . . . , /i„-2, fJ^n-i via (l36T l is unitary, so that its inverse is its transpose. 

Computing spherical harmonic transforms requires several additional computations, detailed (See also 

Remark lSTI below.) The butterfly algorithm replaces only the procedure described in Section 3.1 of [15]. 

In order to use (l35T l and (|36] | numerically, we need to precompute the positive zeros xq, xi, .. ., x„-2, x„^i of 
Pm+2n from ( |23] |. the corresponding quadrature weights po, pi, Pn-2, Pn-i from (l29T l, the positive zeros yo, yi, 
■ ■ ■ , Jn-a, Jn-i of Pm+2n+\ from (l25T l. and the corresponding quadrature weights ctq, crj, . . . , cr„_2, cr„_i from (|30] |. 
Section 3.3 of [15] describes suitable procedures (based on integrating the ordinary differential equation in (1211 1 in 
"Priifer coordinates"). We found it expedient to perform this preprocessing in extended-precision arithmetic, in order 
to compensate for the loss of a couple of digits of accuracy relative to the machine precision. 

To perform the precomputations described in Section]?] above associated with (l35l l and ( |36] |, we need to be able 

— m — — m — m 

to evaluate efficiently all n functions P^, P,„+2> ■■ Pm+2n-4, Pm+2n-2 at any of the precomputed positive zeros xq, xi, 
. . . , x„_2, x„-i of P,„+2n from ( i23T l. and, similarly, we need to be able to evaluate efficiently all n functions P,„+i, P,„+2, 
-Pm+2«-3' Pm+2n-\ at any of the precomputed positive zeros y(),yi, yn-2, yn-\ of Pm+2n+\ from ( l25T l. For this, 
we may use the recurrence relations ( l33l l and (l34l i. starting with the values of PJixi), P^^^iiyi), Pj^^2ixi), and P,„_|_3(y,) 
obtained via (fT9l l. (We can counter underflow by tracking exponents explicitly, in the standard fashion.) Such use of 
the recurrence is a classic procedure; see, for example. Chapter 8 of [1]. The recurrence appears to be numerically 
stable when used for evaluating normahzed associated Legendre functions of order m and of degrees at most m+2n-\, 
at these special points xq, xi, . . ., x„-2, x„-\ and yo, yi, . . . , yn-2, yn-i, even when n is very large. We did not need to 
use extended-precision arithmetic for this preprocessing. 



Remark 5.1. The formula (88) in u5\\ that is analogous to (l35T i of the present paper omits the factors -y^, -y^. 



\jPn-2, ylPn-\ included in ( l35l l. Obviously, the vectors 

(aQ,ai,...,a„-2,an-iY (37) 



and 




(38) 



differ by a diagonal transformation, and so we can obtain either one from the other efficiently. In fact, the well- 



conditioned matrix A from Section 3.1 of 11511 represents the same diagonal transformation, mapping (iJTl l to (l38T l. 
Similar remarks apply to (l36] l. of course. 



6. Numerical results 

In this section, we describe the results of several numerical tests of the algorithm of the present paper. (Computing 
spherical harmonic transforms requires several additional computations, detailed in [ 15] — see Section]5]for further 
information. The butterfly algorithm replaces only the procedure described in Section 3. 1 of ifisil .) 
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Tables [TJ@]report the results of computing from real numbers ySo, y6i, . . . , fin-2, Pn-\ the real numbers uq, a\, 
a„_2, ffn-i defined by the formula 

n-l 

= ^ P^^iMd (39) 

for i - 0, I, . . . , n - 2, n - I, where P^, ■ ■ ■ ' Pm+2n-2^ Pm+in ^i"^ the normalized associated Legendre functions 

defined in (fT9T l. xq, xi, . . . , x„_2, x„_i are the positive zeros of P"'„+2n from ( l23l l. and po. Pi, ■ ■ ■ , Pn-i are the 
corresponding quadrature weights from d29] l. We will refer to the map via (|39] l from ySo, ySi, . . . , /3„-2, Ai-i to oro, ci, 
. . . , ccn-i, ffn-i as the forward transform, and the map via (l39l l from ffo, ■ ■ ctn-i, Q^n-i to y6o, ySi, . . . , y6„_2, /?n-i 
as the inverse transform (the inverse is also the transpose, due to ( fT9] l, (|29] |, and the orthonormality of the normalized 
associated Legendre functions on (-1, 1)). The values of m differ in Tables 1-2 and 3-4. 

Tables |5] and |6] report the results of computing from real numbers vq, vi, . . . , v„_2, v„_i the real numbers fiQ, fii, 
Hn-\ defined by the formula 

— m 

i^i ^ P,„+2j+i(yi) (40) 

./=o 

— f)i — in — in — m 

for / = 0, 1 , . . . , n - 2, « - 1 , where j , P,„^3 , . . . , P,n+2n- 1 ' Pm+2n+ 1 ^^e the normaUzed associated Legendre functions 
defined in ( fT9b . yo, yi, . . . , }'„-2, Jn-i are the positive zeros of P,„+2n+i from (l25T l, and o-q, crj, . . . , cr„_2, cr„_i are the 
corresponding quadrature weights from (l30l l. We will refer to the map via (l40t from vq, vi, . . . , v„_2, v„_i to /io, yUi, 
. . . , /i„-2, /^n-i as the forward transform, and the map via (l40l i from /l/q. A*!, ■ ■ ■ , A'„_2, /^n-i to vq, vi, . . . , v„_2, v„_i as 
the inverse transform (as above, the inverse is also the transpose). 

For the test vectors /? and v whose entries appear in (|39] | and (|40] |. we used normalized vectors whose entries were 
pseudorandom numbers drawn uniformly from (-1,1), normalized so that the sum of the squares of the entries is 1. 

As described in Remark IT?! we compute for each block in the multilevel representation of A an ID whose rank is 
as small as possible while still approximating the block to nearly machine precision. 

The following list describes the headings of the tables: 

• n is the size of the transform, the size of the vectors a and /3 whose entries are given in ( |39] |. and of the vectors 
yu and V whose entries are given in (l40l i. 

• m is the order of the normalized associated Legendre functions used in ( |39] l and ( l40l l. 

• ^max is the maximum of the ranks of the IDs for the blocks in the multilevel representation. 

• k^yg is the average of the ranks of the IDs for the blocks in the multilevel representation. 

• ka- is the standard deviation of the ranks of the IDs for the blocks in the multilevel representation. 

• fdii is the time in seconds required to apply an « x n matrix to an n x 1 vector using the standard procedure. We 
estimated the last two entries for fdh by multiplying the third-to-last entry in each table by 4 and 16, since the 
large matrices required to generate those entries cannot fit in the available 2 GB of RAM. We indicate that these 
entries are estimates by enclosing them in parentheses. 

• ffwd is the time in seconds required by the butterfly algorithm to compute the forward transform via ( [39] l or ( l40l l. 
mapping from ySo, ySi, ■ ■ ■ , Ai-i to ao, au ■ ■ ■ , a„-2, a„-u or from vq, vi, . . . , v„_2, v„_i to //q, jUi, . . . , 

• finv is the time in seconds required by the butterfly algorithm to compute the inverse transform via (|39] | or (|40] |. 
mapping from ao, ffi, . . . , Q'„-2, to j6o, f5u . . . , yS„_2, or from yuo, fn, yU„_2, /U„-i to vq, vi, . . . , 

V„-2, 

• fquad is the time in seconds required in the precomputations to compute the quadrature nodes xo, xi , . . . , 

Xn-2, x„-i or yo, yu yn-2, yn-i, and weights po, pi, . . . , p„-2, P«-i or cry, cri, cr„_2, cr„_i, from 
or dinil, used in ^ and gOll. 
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• fcomp is the time in seconds required to construct the compressed muhilevel representation used in the butterfly 
algorithm, after having already computed the quadrature nodes xq, xi, . . ., x„_2, x„-i or yo, yi, . . . , y„-2, yn-\, 
and weights po, pi, . . . , Pn-2, Pn-i or ctq, cti, . . ., cr„_2, cTn-i, from (l29T l or ( l30t . used in (l39l l and ( l40l i. 

• wjnjax is the maximum number of floating-point words of memory required to store entries of the transform 
matrix during any point in the precomputations (all other memory requirements are negligible in comparison). 

• fifwd is the maximum diflference between the entries in the result of the forward transform computed via the 
butterfly algorithm and those computed directly via the standard procedure for applying a matrix to a vector. 
(The result of the forward transform is the vector a whose entries are given in ( l39b or the vector fj. whose entries 
are given in (l40l i.) 

• Sinv is the maximum difference between the entries in a test vector and the entries in the result of applying to the 
test vector first the forward transform and then the inverse transform, both computed via the butterfly algorithm. 
(The result of the forward transform is the vector a whose entries are given in ( l39b or the vector yU whose entries 
are given in ( |40] |. The result of the inverse transform is the vector y6 whose entries are given in ( |39] | or the vector 
V whose entries are given in (|40]|.) Thus, Sinv measures the accuracy of the butterfly algorithm without reference 
to the standard procedure for applying a matrix to a vector (unlike Sfwd)- 

For the first level of the multilevel representation of the n x n matrix, we partitioned the matrix into blocks each 
dimensioned n x 60 (except for the rightmost block, since n is not divisible by 60). Every block on every level has 
about the same number of entries (specifically, 60n entries). We wrote all code in Fortran 77, compiling it using the 
Lahey-Fujitsu Linux Express v6.2 compiler, with optimization flag — o2 enabled. We ran all examples on one core 
of a 2.7 GHz Intel Core 2 Duo microprocessor with 3 MB of L2 cache and 2 GB of RAM. As described in Section|5] 
we used extended-precision arithmetic during the portion of the preprocessing requiring integration of an ordinary 
differential equation, to compute quadrature nodes and weights (this is not necessary to attain high accuracy, but does 
yield a couple of extra digits of precision). Otherwise, our code is compliant with the IEEE double-precision standard 
(so that the mantissas of variables have approximately one bit of precision less than 16 digits, yielding a relative 
precision of about .2E-15). 

Remark 6.1. Tables [T][3l and |5] indicate that the linear transformations in ( l39l l and ( l40b satisfy the rank property 
discussed in Subsection l3.2l with arbitrarily high precision, at least in some averaged sense. Furthermore, it appears 
that the parameter k discussed in Subsection l3.2l can be set to be independent of the order m and size n of the transforms 
in (39[ and ( l40t . with the parameter C discussed in Subsection l3.2l roughlv proportional to k. The acceleration provided 
by the butterfly algorithm thus is sufficient for computing fast spherical harmonic transforms, and is competitive 
with the approach taken in [15] (though much work remains in optimizing both approaches in order to gauge their 



relative performance). Unlike the approach taken in 01511 . the approach of the present paper does not require the 



use of extended-precision arithmetic during the precomputations in order to attain accuracy close to the machine 
precision, even while accelerating spherical harmonic transforms about as well. Moreover, the butterfly can be easier 
to implement. 

Remark 6.2. The values in Tables |2] |4] and |6] vary with the size n of the transforms in (39[ and ( l40l i as expected. 
The values for fquad are consistent with the expected values of a constant times n. The values for fcomp are consistent 
with the expected values of a constant times (with the constant being proportional to ^avg)- The values for mmax are 
consistent with the expected values of a constant times n log(«) (again with the constant being proportional to A;avg); 
these modest memory requirements make the preprocessing feasible for large values of n such as those in the tables. 

Remark 6.3. In the current technological environment, neither the scheme of ifisll nor the approach of the present 
paper is uniformly superior to the other For example, the theory from flS*] is rigorous and essentially complete, 
while the theory of the present article ideally should undergo further development, to prove that the rank properties 
discussed in Remark lSTI are as strong as numerical experiments indicate, yielding the desired acceleration. In contrast, 
to attain accuracy close to the machine precision, the approach of [Ts'l requires the use of extended-precision arithmetic 
during its precomputations, whereas the scheme of the present paper does not. Implementing the procedure of the 
present article can be easier Finally, an anonymous referee kindly compared the running-times of the implementations 



11 



reported in BlSll and the present paper, noticing that the newer computer system used in the present article is about 
2.7 times faster than the old system; the algorithms of 11511 and of the present article are roughly equally efficient — 
certainly neither appears to be more than twice faster than the other. However, both implementations are rather crude, 
and could undoubtedly benefit from further optimization by experts on computer architectures; also, we made no 
serious attempt to optimize the precomputations. Furthermore, with the advent of multicore and distributed processors, 
coming changes in computer architectures might affect the two approaches differently, as they may parallelize and 
utilize cache in different ways. In the end, the use of one approach rather than the other may be a matter of convenience, 
as the two methods are yielding similar performance. 



7. Conclusions 

This article provides an alternative means for performing the key computational step required in [ 15] for comput- 
ing fast spherical harmonic transforms. Unlike the implementation described in 1 15] of divide-and-conquer spectral 
methods, the butterfly scheme of the present paper does not require the use of extended precision during the compres- 
sion precomputations in order to attain accuracy very close to the machine precision. With the butterfly, the required 
amount of preprocessing is quite reasonable, certainly not prohibitive. 

Unfortunately, there seems to be little theoretical understanding of why the butterfly procedure works so well for 
associated Legendre functions (are the associated transforms nearly weighted averages of Fourier integral operators?). 
Complete proofs such as those in ifTlll and ifisll are not yet available for the scheme of the present article. By con- 
struction, the butterfly enables fast, accurate applications of matrices to vectors when the precomputations succeed. 
However, we have yet to prove that the precomputations will compress the appropriate nxn matrix enough to enable 
applications of the matrix to vectors using only 0{n log(n)) floating-point operations (flops). Nevertheless, the scheme 
has succeeded in all our numerical tests. We hope to produce rigorous mathematical proofs that the precomputations 
always compress the matrices as much as they did in our numerical experiments. 

The precomputations for the algorithm of the present article require (9(n^) flops. The precomputations for the 
algorithm of |15] also require (9(n^) flops as implemented for the numerical examples of that paper; however, the 
procedure of [7] leads naturally to precomputations for the approach of [ 15] requiring only 0{n log(n)) flops (though 
these "more efficient" precomputations do not become more efficient in practice until n is absurdly large, too large 
even to estimate reliably). We do not expect to be able to accelerate the precomputations for the algorithm of the 
present article without first producing the rigorous mathematical proofs mentioned in the previous paragraph. Even 
so, the current amount of preprocessing is not unreasonable, as the numerical examples of Section|6]illustrate. 
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Table 1: Ranks and running-times for even degrees 



n 


m 


^max 


^avg 


kg- 


^dir 


ffwd 


^inv 


1250 


1250 


170 


65.3 


29.5 


.25E-2 


.18E-2 


.15E-2 


2500 


2500 


168 


67.0 


32.6 


.98E-2 


.46E-2 


.38E-2 


5000 


5000 


195 


70.5 


35.9 


.39E-1 


.12E-1 


.96E-2 


10000 


10000 


247 


73.5 


38.7 


.15E-0 


.28E-1 


.23E-1 


20000 


20000 


308 


75.9 


41.5 


(.60E-0) 


.67E-1 


.55E-1 


40000 


40000 


379 


78.0 


43.8 


(.24E-I-1) 


.16E-0 


.13E-0 




Table 2: Precomputation times, memory requirements, and accuracies for even degrees 




n 


m 


^quad 




^comp 


^max 


Sfwd 


^inv 


1250 


1250 


.46E2 




.96E0 


.86E6 


.62E-14 


.19E-13 


2500 


2500 


.92E2 




.41E1 


.20E7 


.37E-14 


.25E-13 


5000 


5000 


.18E3 




.17E2 


.50E7 


.59E-14 


.43E-13 


10000 


10000 


.37E3 




.82E2 


.14E8 


.32E-14 


.57E-13 


20000 


20000 


.74E3 




.39E3 


.29E8 


.30E-14 


.88E-13 


40000 


40000 


.15E4 




.17E4 


.64E8 


.24E-14 


.13E-12 






Table 3: Ranks and running-times for even degrees 






n 


m 


J- 

'^max 


^avg 




fdir 


ffwd 




1250 





110 


67.0 


23.3 


.25E-2 


.18E-2 


.15E-2 


2500 





110 


70.0 


25.0 


.98E-2 


.48E-2 


.40E-2 


5000 





111 


73.9 


26.1 


.39E-1 


.12E-1 


.lOE-1 


10000 





111 


77.3 


26.8 


.15E-0 


.29E-1 


.24E-1 


20000 





112 


80.2 


27.1 


(.60E-0) 


.68E-1 


.57E-1 


40000 





169 


82.7 


27.4 


(.24E-I-1) 


.16E-0 


.13E-0 




Table 4: Precomputation times, memory requirements, and accuracies for even degrees 




n 


m 


^quad 




^comp 


'^max 


fifwd 




1250 





.44E2 




.96E0 


.86E6 


.49E-14 


.12E-12 


2500 





.88E2 




.41E1 


.20E7 


.35E-14 


.14E-12 


5000 





.18E3 




.18E2 


.51E7 


.23E-14 


.35E-12 


10000 





.36E3 




.82E2 


.14E8 


.18E-14 


.63E-12 


20000 





.74E3 




.40E3 


.29E8 


.20E-14 


.22E-11 


40000 





.14E4 




.18E4 


.66E8 


.16E-14 


.37E-11 



13 



Table 5: Ranks and ranning-times for odd degrees 



n 


m 


^max 


^avg 




^dir 




^inv 


1250 


1250 


170 


65.3 


29.5 


.25E-2 


.17E-2 


.15E-2 


2500 


2500 


169 


67.0 


32.6 


.98E-2 


.48E-2 


.39E-2 


5000 


5000 


196 


70.5 


35.9 


.39E-1 


.12E-1 


.97E-2 


10000 


10000 


247 


73.5 


38.7 


.15E-0 


.28E-1 


.24E-1 


20000 


20000 


308 


75.9 


41.4 


(.60E-0) 


.68E-1 


.56E-1 



40000 40000 379 78.0 43.8 (.24E+1) .16E-0 .13E-0 



Table 6: Precomputation times, memory requirements, and accuracies for odd degrees 



n 


m 


^quad 


^comp 


'^max 


Sfwd 


^inv 


1250 


1250 


.46E2 


.94E0 


.86E6 


.41E-14 


.19E-13 


2500 


2500 


.92E2 


.40E1 


.20E7 


.41E-14 


.29E-13 


5000 


5000 


.18E3 


.17E2 


.50E7 


.40E-14 


.51E-13 


10000 


10000 


.37E3 


.80E2 


.14E8 


.31E-14 


.62E-13 


20000 


20000 


.74E3 


.39E3 


.29E8 


.34E-14 


.lOE-12 


40000 


40000 


.15E4 


.18E4 


.64E8 


.25E-14 


.14E-12 
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